Multiple Language Gender Identification for Blog Posts

نویسندگان

  • Juan Soler
  • Leo Wanner
چکیده

In data-driven gender identification, it has been so far largely assumed that the same types of (mostly content-oriented) data features can be used to differentiate between male and female authors. In most cases, this distinction is done in a monolingual scenario. In this work, we discuss a set of features that distinguish between genders in six different datasets of blog data in English, Spanish, French, German, Italian and Catalan with accuracies that range from 77% to 88%. Using a reduced set of language-independent structural features in a multilingual scenario we first identify the gender and then the gender and language of the author, achieving accuracies higher than 74%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ICTNET at Blog Track TREC

This paper describes our participation in blog track of TREC2009. All runs are submitted for both two task, namely Top stories identification task and faceted blog distillation task. The “FirteX” platform was used to index and retrieval posts. As for top stories identification task, to identify important headlines, we measure the importance of headline by accumulating the BM25 relevance score w...

متن کامل

Automatic Estimation of Bloggers' Gender

We propose an approach employing Support Vector Machine (SVM) to estimate bloggers’ gender from blog posts. The data we analyze consists of blog posts on Doblog (Japanese blog-hosting service) and questionnaire results by Doblog users. Experimental evaluations show that our approach achieved 90% accuracy for 83% bloggers.

متن کامل

ICTNET at Blog Track TREC 2009

This paper describes our participation in blog track of TREC2009. All runs are submitted for both two task, namely Top stories identification task and faceted blog distillation task. The “FirteX” platform was used to index and retrieval posts. As for top stories identification task, to identify important headlines, we measure the importance of headline by accumulating the BM25 relevance score w...

متن کامل

Predicting gender from blog posts

Blogs are informal, personal writings that people post on their own blog sites. Nowadays, blogging is an important online activity. People share blogs with their friends and family members. The topics of blog posting cover almost everything, ranging from personal life, political opinions, recipes, product reviews, or even just random rants. Although some bloggers review their biologically infor...

متن کامل

Identifying Facets in Query-Biased Sets of Blog Posts

We investigate the identification of facets of query-biased sets of blog posts. Given a set of blog posts relevant to a topic, we compare several methods for identifying facets of the topic in this set. Building on a clustering of a set of blog posts, we compare several cluster labeling methods, and find that a method that makes use of blog and blog search specific features outperforms other me...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015